Implement SSH connection pool for runner instances#3936
Merged
Conversation
un-def
approved these changes
Jun 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of #3920
Closes #2933
Add
InstanceConnectionPool/InstanceConnectionclasses that allow re-using SSH connections to runner instances for shim and runner API port forwarding. Previously, the dstack server had to constantly re-establish SSH connections which affected CPU load and slowed down processing. Therunner_ssh_tunnel()decorator is updated to use the pool, so clients are mostly unchanged.Impact
The run startup time (#3920) on a provisioned instance with pulled image went from ~7s to 1-2s (as it was mostly limited by ssh connection re-creation):
Also CPU utilization on the dstack server machine no longer spikes due to opening many SSH connections to many instances constantly (#2933).
Notes and implementation details
DSTACK_SERVER_SSH_POOL_ENABLED. The plan is to test the pool in the next release, then enable the pool by default, and document how to opt-out if RAM usage is a concern.inactivity_durationfeature distinguishes user and server connections based on duration.)DSTACK_SERVER_DIR. It's expected that the pool is disabled in such setups. (It's already kinda half-working with gateway connections.)runner_ssh_tunnel()incl.retries=3– retries seems to be legacy here and are no longer needed after Introduce JOB_DISCONNECTED_RETRY_TIMEOUT #2627 (2m timeout before running job is kicked from an unreachable instance). Added and documentedDSTACK_SERVER_SSH_CONNECT_TIMEOUTenv var to increase default ConnectTimeout if server-instance latency is always >3s.